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ABSTRACT 



A relational database system has a non- volatile memory, a 
volatile memory for temporarily storing a set of data records, 
and a query processor. The volatile memory has an amount 
of available space for query processing that is segmented 
into multiple memory pages. Initially, these memory pages 
are empty and available in a pool for use by the query 
processor. The query processor establishes a partition table 
that defines multiple partitions. The query processor parti- 
tions incoming data records into the partitions according to 
a hashing function and stores the data records in memory 
pages associated with the partitions. As a new data record 
placed into a particular partition, the query processor 
attempts to aggregate the new data record with any like data 
record that already exists in the particular partition. If no like 
data record exists, the data record is stored separately on the 
memory page within the partition. In the event that a 
memory page of the partition becomes filled, the query 
processor retrieves an empty memory page from the free 
pool and assigns that empty memory page to the needy 
partition. In the event that no free memory pages are left in 
the pool, the query processor selects a memory page from 
any one of the partitions and writes the data records on the 
selected memory page to the non-volatile memory to free the 
memory page. The query processor selects the memory page 
according to selection criteria that favors output of full 
memory pages over partially filled memory pages and that 
favors memory pages with a low absorption rate. Data 
records with low activity are written to non-volatile memory 
in the interest of preserving data records with high absorp- 
tion rates on the memory pages with the hope of absorbing 
future data records. 

57 Claims, 9 Drawing Sheets 
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RELATIONAL DATABASE SYSTEM AND 
METHOD FOR QUERY PROCESSING USING 
EARLY AGGREGATION 

TECHNICAL FIELD s 

This invention relates to relational database systems. 
More particularly, this invention relates to techniques for 
processing "group by" queries in relational database sys- 
tems. 

10 

BACKGROUND OF THE INVENTION 

Relational database systems are a type of database or 
database management system that stores information in 
tables — rows and columns of data — and conducts searches 
by using data in specified columns of one table to find 
additional data in another table. Typically, the rows of a table 
represent records (collections of information about separate 
items) and the columns represent fields (particular attributes 
of a record). An example of a relational database system is 
the SQL (Structured Query Language) Server database sys- 
tem manufactured and sold by Microsoft Corporation. 

A particular type of query that is frequently used in 
relational database systems is to group records according to 
a particular parameter. A query of this type is often referred ^ 
to as a "GROUP BY" query. As an example, suppose an 
administrator wants to tally a billing total for a particular 
customer from a number of invoices. Each record contains 
a customer identification number and a dollar amount. The 
administrator might formulate a GROUP BY query that 3Q 
groups the invoices according to customer ID and then 
subsequently tallies the dollar amounts on the invoices for 
each customer. 

The data records are commonly stored on disk arrays or 
other forms of non-volatile memory. Queries performed on 35 
relational databases, such as the GROUP BY query, can be 
more easily accommodated if all of the data records are 
loaded into volatile memory (i.e. RAM) for processing. 
However, relational databases often contain large amounts 
of data, which far surpasses the volatile memory resources. 40 
As a result, records are handled in volatile memory in batch, 
and then written to disk as they are processed. 

FIG. 1 shows a traditional approach to evaluating a 
GROUP BY query, which uses sorting and aggregation 
techniques. Adata stream 20 has a series of data records that 45 
represent invoices. Each data record has a customer ID (the 
top number) and a dollar amount (the bottom number). For 
convenience, suppose that only two records can fit into 
volatile memory at a time. As a batch 22 of two records is 
loaded into volatile memory, the query process sorts the two 50 
records according to customer ID so that the highest num- 
bered customer ID is first and the lowest numbered customer 
ID is second. The sorted batch, usually called a run, is then 
written to the storage disks. The process is repeated for each 
batch of two records to produce the data set 24 storing the 55 
runs. 

Next, the query process reads the sorted runs from disk 
and merges them to produce the data set 26. The records in 
the merged data set 26 are now sorted according to customer 
ID. The query process then aggregates records with like 60 
customer IDs. Since the GROUP BY query is interested in 
a final billing total for each customer based on one or more 
invoices, all records having the same customer number can 
be aggregated and represented as a single record having the 
customer ID and a dollar value equal to the sum of the 65 
amounts in the individual records. Here, the two records for 
customer ID of 60 are combined into a single record having 



2 

the customer ID of 60 and a tallied dollar amount of $86 
(i.e., 54+32). Similarly, the two records for customer ID of 
52 are also combined into a single record. The aggregation 
phase produces a data set 28. 

Notice that the amount of data handled throughout the 
query process remains essentially constant until the final 
aggregation stage. It would be beneficial to reduce the 
amount of data earlier in the process. One prior art technique 
for achieving this goal is known as "early aggregation," in 
which like data records found in initial batches are combined 
early on, before the records are sorted and merged. 

FIG. 2 shows a conventional query process, which utilizes 
an early aggregation technique. Batch 30 of the data stream 
20 contains two records with the same customer ID of 60. 
Using an early aggregation process, the records are com- 
bined in volatile memory to form a single record with a 
summed dollar value. The result of the early aggregation 
process is a reduced data set 32, which has one less record 
than stream 20. The reduced data set 32 is then processed 
according to the same sorting, merging and aggregating 
processes described above for FIG. 1 to produce data sets 34, 
36, and 38, respectively. 

Early aggregation reduced the number of data. records 
handled during the query process. In this simple example, 
only one record is eliminated early on; however, significant 
savings can be achieved when processing large numbers of 
data records. 

A separate technique that is often used in large-scale 
queries is known as "hash partitioning." This technique 
involves hashing the data records according to a query 
parameter to put the records into smaller work files. The 
early aggregation, sorting, and merging can then be applied 
to the smaller work files. 

FIG. 3 shows a conventional query process, which uses in 
part a hash partitioning process to place the records in 
smaller work groups. In FIG. 3, the data records are parti- 
tioned into one of five buckets according to the last digit in 
the customer ID. For instance, bucket 0, 5 is for all records 
in which the last digit of the customer ID is a zero or five. 
Each bucket has a dedicated piece of memory sized to hold 
two records. 

The records are placed in the appropriate buckets accord- 
ing to the hashing function. Record 52, 10 is placed in 
bucket 2, 7 because the customer ID of 52 has a last digit of 
2; record 77, 44 is placed in bucket 2, 7; and record 65, 12 
is placed in bucket 0, 5. The next record 52, 30 has the same 
customer ID as the first record. An early aggregation process 
can be used at the bucket level to aggregate this record with 
the earlier record. 

FIG. 4 shows the result of placing the record 52, 30 into 
bucket 2, 7. The records 52, 10 and 52, 30 are combined to 
form a single record 52, 40, which remains in the bucket 2, 
7. Similarly, the next two data records 60, 54 and 60, 32 are 
combined in bucket 0, 5 to form a single record 60, 86. 
Notice that both buckets 0, 5 and 2, 7 are full. When a next 
record is received that cannot be aggregated with an existing 
record, an overflow condition is reached. To free up memory 
space in the bucket, one of its records is written to an output 
buffer. When the output buffer becomes full, its records are 
posted to a temporary file on the storage disk. Typically, the 
bucket memory operates on a first in, first out (FIFO) policy, 
and hence the first data record written into the memory is the 
one selected to be written out to the output buffer when the 
memory becomes full. 

FIG. 5 shows the effect of adding a record to each of the 
buckets 0, 5 and 2, 7, thereby causing an overflow condition 
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for each bucket. The record 65, 12 from bucket 0, 5, along high absorption rates. In one particular implementation, the 

with the record 52, 40 from bucket 2, 7, are written to the query processor selects the memory page based upon how 

output buffer. The records remain in the output buffer until recently the data records have been updated by being 

that buffer is full. Once full, the entire contents of the output aggregated with incoming data records. Data records with 

buffer file are flushed to a temporary file on the storage disk 5 low aggregation activity are written to non-volatile memory 

to free the output buffer for more overflow records. in the interest of preserving data records with high absorp- 

FTG. 6 shows an alternative construction in which an uon rates on the memory pages in the hope of absorbing 

output buffer and associated temporary file on disk are future data records. 

allocated for each bucket. In this structure, the overflow The query process distinguishes between "unsafe" and 

records are posted to the output buffer associated with the 10 "safe" records stored in memory. A record is "unsafe" if it 

bucket. As the output buffer for a corresponding bucket was created after its hash partition spilled its first page to the 

becomes full, its contents are written out to the associated associated temporary file. Otherwise, it is "safe". When the 

temporary file on disk. end of the data record series is reached, all safe records can 

When the system reaches the end of the input data stream, be output directly without further processing. On the other 

the system repeats the aggregation process on the temporary 15 hand, all unsafe records are processed further because they 

files to further combine data records. The process is com- might subsequently merge with other records previously 

pleted when all combinable data records are aggregated in written to non-volatile memory. The query process can also 

main memory to produce a final output. be configured to distinguish between "unsafe" and "safe" 

While these conventional techniques are effective, they do memorv P a S es - A memory page is considered "unsafe" as 

not fully capitalize on the benefits of early aggregation or 20 soon as it contaxns one unsafe record; otherwise, His a "safe" 

efficiently use the allocated memory within the partitions. memory page. 

The inventor has developed a query process and data struc- nFSrRTPTTON OF THF DRAWINGS 

ture that overcomes these drawbacks. BRIEF DESCRIPTION Or VHh DRAWINGS 

SUMMARY OF THE INVENTION 25 FIG - * is a j^L™ ° f 3 P rio ^ r W 

process for evaluating a GROUP BY query, which uses 

This invention concerns a relational database system and sorting and aggregation processes, 

method for processing queries in a manner that optimizes FIG. 2 is a diagrammatic illustration of the FIG. 1 query 

early aggregation in the context of hash partitioning to process> which includes an early aggregation phase, 

minimize the number of I/O operations and that efficiently . . , t1 ... t , , , 

iM . , L1 f , r u 30 FIG. 3 is a diagrammatic illustration of a prior art hash 

utilizes memory available for the query process. .... ^ c j ■ * « i 

7 . partitioning process of segregating records into smaller work 

According to one implementation, a relational database fileg QT buckets 

system has a non-volatile memory, a volatile memory for „ . . ' . .„ t . P - , . 

temporarily storing a set of data records, and a query ™?. 4 15 a diagrammatic lUustraUon of the FIG. 3 hash 

processor. The volatile memory has an amount of available 35 Phoning process which further shows an early aggrega- 

r r 4 j - * i*- i tion phase within the buckets, 

space for query processing that is segmented into multiple yua^ nu 

memory pages. Initially, these memory pages are empty and PIG- 5 is a diagrammatic illustration of the FIG. 3 hash 

available in a pool for use by the query processor. partitioning process, which shows an overflow condition in 

The query processor establishes a partition table that which records are written 10 a temporary file, 

defines multiple partitions. The query processor hashes 40 FIG. 6 is a diagrammatic illustration of the FIG. 3 hash 

incoming data records to an entry in the partition table. The partitioning process, which shows an overflow condition in 

entry directs the data records to the appropriate partitions. which records are written to temporary files associated with 

One memory page is initially assigned to each partition or their buckets. 

alternatively taken from the free pool as needed by the FIG. 7 shows a relational database computer system, 

partition. The data records are stored in the memory pages 45 pj G g ^ a diagrammatic illustration of data structures 

associated with the partitions. Before a new data record is used ^ a q Uery processing technique according to an exem- 

placed into a particular partition, the query processor p i ary implementation of this invention. FIG. 8 shows the 

attempts to aggregate the new data record with any like data data structures at an initial point in the query process, 

record that already exists in the particular partition. In one piG 9 shows ^ F[G g ^ stmctures at a point in the 

implementation, a separate, much larger, hash table is uU- 50 query process subsequent to that shown in FIG. 8. FIG. 9 

lizedforquicklookupofanylikedatarecords.I^ jJlu^tcs Uon oftwo data rccords . 

record exists, the incoming data record is stored separately •* • . 

on one of the memory pages associated with the partition. HO. 10 shows the FIG. 8 data structures at a point in the 

_ . 4 c*u u ~ query process subsequent to that shown in FIGS. 8 and 9. 

In the event that a memory page of the partition becomes £ inn*-** * e e 

£ 4 . ;. r ^ r , FIG. 10 illustrates assignment of free memory pages to 

full, the query processor retrieves an empty memory page ss . fi . .... & ' * 6 

from the free pool and assigns that empty memory page to P re e Q P a 1 0DS * 

the needy partition. In the event that no free memory pages FIG - U shows the FIG. 8 data structures at a point in the 

are left in the pool, the query processor selects a memory °i uer y process subsequent to those of FIGS. 8-10. FIG. 11 

page from any one of the partitions and writes the data illustrates how memory pages in one partition are emptied 

records on the selected memory page to the non-volatile 60 md reassigned to another partition. 

memory to free the selected memory page. The selected DETAILED DESCRIPTION OF THE 

memory page is appended o a temporary file associated PREFERRED EMBODIMENT 

with the partition to which it belongs. 

The query processor selects the memory page according FIG. 7 shows a relational database system 50 having a 
to selection criteria that favors output of (1) full memory 65 computing unit 52 and a non-volatile memory 54 interfaced 
pages over partially filled memory pages and (2) memory with the computing unit 52. The computer 52 has a pro- 
pages with low absorption rates over memory pages with cessing unit 56 and a main memory 58. The main memory 
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56 is volatile and can be implemented, for example, as 
volatile RAM (Random Access Memory). The non -volatile 
memory 54 provides permanent storage for relational data- 
base records. The non-volatile memory 54 can be imple- 
mented in a variety of ways, including disk arrays, disk 
drives (e.g., hard and floppy), read/write CD ROMS, tape 
backups, reel-to-reel, and the like. 

The relational database system 50 is shown in an opera- 
tional state in which a relational database program 60 is 
loaded in main memory 58 for execution on the processing 
unit 56. The relational database program 60 is permanently 
stored on non-volatile memory 54 and loaded into the main 
memory 58 when launched. An example of a relational 
database program is the SQL Server program sold by 
Microsoft Corporation. It is also noted that aspects of this 
invention concerning query processing may be used in other 
types of programs that may employ relational database 
concepts, such as spreadsheet programs, accounting 
software, workflow management software, and the like. 

The relational database program 60 has a query process- 
ing program 66 that implements aspects of this invention. 
The query processing program 66 utilizes hash partitioning 
to separate records into partitioned work files that are more 
easily managed. A partition table 62 is constructed by the 
query processing program 66 to contain entries identifying 
the various partitions. Data records are hashed to a particular 
entry in the partition table 62 and then placed in the partition 
referenced by that entry. 

A portion of the main memory 58 is available for query 
processing. Preferably, the available memory is divided into 
fixed-size memory pages 64 that are utilized in the partitions 
to store the data records during processing. In this manner, 
the partitions are formed as one or more memory pages. 

The relational database system 50 is representative of 
many diverse implementations, including a standalone 
computer, a database server for a network of PCs or 
workstations, an online server for Internet service providers, 
a mainframe computing system, and the like. The relational 
database system 50 runs on top of an operating system (not 
shown), which is preferably a multitasking operating system 
that allows simultaneous execution of multiple applications 
or multiple threads of one or more applications. Examples of 
suitable operating systems include a Windowsl® brand 
operating system sold by Microsoft Corporation, such as the 
Windows NT® workstation operating system, as well as 
UNIX based operating systems. 

FIG. 8 shows how the relational database program 60 
processes a GROUP BY query on a series of data records 70. 
In this example, the data records 70 represent invoices 
having a customer ID and a dollar amount. The series of data 
records is loaded sequentially into main memory 58. The 
data records 70 can originate from a number of different 
sources, such as from permanent files on the non-volatile 
memory 54, or from an executing application, or from a 
remote source on a network. For purposes of this discussion, 
suppose an administrator formulates a GROUP BY query 
that groups the invoices according to customer ID and then 
subsequently tallies the dollar amounts in the invoices for 
each customer to produce a final billing record having a 
customer ID and total dollar amount owed. 

The relational database program 60, and more particularly 
the query processing program 66, creates a hash or partition 
table 62 having a number of entries that reference corre- 
sponding partitions. In this example, the hash function 
partitions the data records according to the last digit in the 
customer ID into five partitions or buckets. The five entries 
in partition table 62 identify the five buckets. 
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The memory pages 64(1)-64(N) are initially empty and 
placed in a free list 72. In this example, the memory pages 
are sized to hold two records, although in practice the 
memory pages will hold many records. The memory pages 

S are taken from the free list 72 and assigned to the appropriate 
bucket as the incoming data records warrant. Alternatively, 
the query processing program initially assigns at least one- 
memory page for each bucket, as is shown in FIG. 8. The 
entries in the partition table 62 reference the chain of one or 
more memory pages 64 in the buckets.- The entries may 
contain a pointer to a particular memory location, or some 
other means for identifying the chains of memory pages. 

The query processing program 66 also creates a separate, 
much larger, hash table 68 to enable fast lookup of matching 

15 records. The match lookup table 68 has many more entries 
than partition table 62 and is accessed using a different 
hashing function than the function used for the partition 
table. Each entry contains a chain of pointers to individual 
data records in the memory pages 64. Unlike the partition 

20 table, many entries in the match lookup table 68 pertain to 
records on the same memory page. The number of data 
records referenced by each entry is thus significantly less 
than the number of data records in a partition referenced by 
the partition table 62. In this manner, the query processing 

25 program 66 can quickly lookup through the match hash table 
68 to determine whether the data record matches an existing 
data record. 

As each data record 70 is received, the query processing 
program 66 applies the first hashing function I to hash the 
30 data record according to its query parameter (i.e., grouping 
column), which in this case is the customer ID, to an entry 
in the match lookup table 68. In this example, there are no 
records yet stored in the memory pages, and hence no match 
will exist. 

35 The query processing program 66 then hashes the data 
record using a second hashing function II to an entry in the 
partition table 62. The first data record 52, 10 is hashed to 
an entry 74 in the partition table 62. Many different kinds of 
hashing functions can be used for. each table. Preferably, the 

40 hashing function used for the partition table is designed to 
provide an even distribution of records among the partitions. 
One preferred hashing function is described in an article by 
Per-Ake Larson entitled Dynamic Hash Tables, Communi- 
cations of the ACM, 31, 4 (1988), pg. 446-457. The partition 

45 table entry 74 contains a reference to partition 2, 7. As a 
result, the first data record 52, 10 is placed in the memory 
page 64(3) of bucket 2, 7. The first data record fills in one 
of the two slots on the memory page 64(3). A reference to 
the record is also added to the match lookup table 68. 

50 The relational database program 60 continues to process 
the incoming data records 70. As more data records are 
processed, the query processing program 66 attempts to 
aggregate like data records within the partitions, The query 
processing program 60 hashes each record to an entry in the 

55 match lookup table 68 and compares the query parameter 
(i.e., customer ID) with those of the existing data records 
referenced by the table entry in an attempt to find a match. 
If a match is found, the aggregation columns (i.e., the dollar 
amount) are updated (in this case summed) immediately, and 

60 the query processing program continues with the next record 
without use of the partition table. If no match is found, 
however, the query process program 66 hashes the data 
record to an entry in the partition table and the data record 
is placed by itself into a memory page of the appropriate 

65 partition. 

FIG. 9 shows the same structure of FIG. 8, but at a point 
when two records are aggregated. When the fourth data 
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record 52, 30 is reached, the query processing program 66 theory that data records with low absorption rates (meaning 

hashes the record according to the hashing function I to an that they have not often aggregated with like data records) 

entry 75 in the match lookup table 68. In this example, the are less likely, if left in the buckets, to aggregate with 

customer ID 52 matches an existing record. Accordingly, the incoming data records. Hence, these data records should be 

query processing program 66 aggregates the two records in 5 written to non-volatile memory in the interest of preserving 

memory page 64(3) of bucket 2, 7. The combined records data records with hi 8 h absorption rates (meaning that they 

have summed dollar amount of 40 (i.e., 10+30). When a < end to aggregate more often) on the memory pages in the 

match is found, the query processing program proceeds to £>P e of absorbing a higher percentage future data records, 

the next data record. The match lookup table speeds up the 15 t0 ^gate data records as much as possib e 

4 r „ ♦■* *u,o while the records are in the memory pages of volatile 

aggregation process in comparison itc » attempting tc , match 10 m ft reduces ^ numhG /Jy% s used duri 

through the partition table (although the partition table may ^ que J fy processing 

be used for match lookup). ^ scUaian process can be im pi em eQted to examine 

Eventually, the memory pages become full. An overflow absorption rates of individual data records or of pages (i.e., 

condition results when the query processing program m average of multiple data records). Once the relational 

attempts to add a new data record to a full memory page. 15 database program finds the data record or page that has the 

When an overflow condition occurs, the relational database lowest absorption rate, the relational database program 

program appends a memory page from the free list 72 to the writes the entire memory page out to the non-volatile 

appropriate partition and links the memory page to the memory. 

existing chain for that partition. ^ alternative is to select individual data records from 

Note that the subsequent explanation assumes that the 20 one of the partitions according to the selection criteria, copy 

match lookup table is still used even though it is not those individual records to an output buffer, and write the 

explicitly shown in FIG. 10 and FIG. 11. (full) buffer to the temporary file associated with the parti- 

FIG. 10 shows the same structure of FIGS. 8 and 9, but tion. The drawback with this approach is that the relational 

after an overflow condition is reached. The data record 75, database program must set aside memory for the output 

25 in the series 70 hashes to an entry 76 in the partition table buffer that could otherwise be used for early aggregation in 

62, which references bucket 0, 5. The query processing the buckets. 

program 66 attempts to aggregate the new data record 75, 25 The memory page (or set of individual records) selected 

with an existing record, but there is no match. Accordingly, as the victim can be taken from any partition, regardless of 

the query processing program will store the new data record 3Q which partition needs the emptied memory page. That is, the 

75, 25 individually. However, the memory page 64(1) is full relational database program need not empty a memory page 

and hence an overflow condition results. When the overflow in the specific bucket into which the next data record will go. 

condition occurs, the relational database program appends a Rather, the relational database program can empty a memory 

new memory page 64(6) to the full memory page 64(1) in page in another bucket and then transfer that empty memory 

bucket 0, 5. The data record 75, 25 is then placed in the new 3S page from its original bucket to the needy bucket to receive 

memory page 64(6). the incoming data record. 

This process of filling the partitions continues until there Suppose in FIG. 11, for instance, that memory page 64(5) 

are either no more free memory pages or no more input is selected as having the lowest absorption rate. The rela- 

records. If the series of data records 70 runs out before the tional database program empties the memory page 64(5) to 

available memory space is exhausted, the relational database 40 a partition file in the non-volatile memory. There is prefer- 

program simply outputs all data records in the memory ably one permanent partition file for each partition in main 

pages to the non-volatile memory and the process is finished. memory. Writing the data records on memory page 64(5) to 

All of the data records are grouped and aggregated, and the non-volatile memory frees the memory page 64(5) for use in 

results can be presented to the administrator. On the other the same or different partition. In this example, the empty 

hand, if the available memory space is exhausted before the 45 memory page 64(5) is moved to a different bucket 3, 8. The 

data records run out, the relational database program must incoming data record 43, 69 can then be placed in the 

empty one or more memory pages by writing the contents transferred memory page 64(5). 

out to non-volatile memory to create additional free space. It is noted that all memory pages in a partition can be 

FIG. 11 shows the same structure of FIGS, 8-10, but after moved to other partitions if the conditions warrant. For 

many data records have been partitioned. All of the memory 50 example, suppose that the query processing program selects 

pages 64(1)-64(11) have been assigned to one of the parti- memory page 64(2)— the only page in bucket 1, 6— as the 

tions. The next data record 43, 69 in the series 70 is hashed victim. Page 64(2) would be emptied to non-volatile 

to entry 78 of the partition table 62. Entry 78 references memory, cleaned, and transferred to bucket 3, 8. This 

bucket 3, 8. Here, both of the memory pages 64(4) and 64(8) process would leave no memory pages in bucket 1,6. This 

in the bucket 3, 8 are filled. In addition, there are no more 55 situation, however, is not a problem since the next record 

memory pages in the free list (not shown). In this situation, directed to the bucket 1, 6 would cause an overflow condi- 

the relational database program must empty a page to tion for that partition, and the query processor would free up 

accommodate the next data record. The issue is which a memory page from elsewhere and move the emptied 

memory page should be written to the non-volatile memory. memory page to the bucket 1, 6. 

According to an aspect of this invention, the relational 60 To overlap writing to partition files on the non-volatile 

database program selects a victim memory page according memory and processing incoming data records, an imple- 

to a selection criteria that favors output of (1) full memory mentation variation is to maintain a selected number of 

pages over partially filled memory pages and (2) memory memory pages on the free list, rather than allowing the free 

pages with low absorption rates over memory pages with list to exhaust completely. In this manner, an additional 

high absorption rates. The first criteria attempts to optimize 65 memory page is always available from the free list, 

the I/O resources by outputting a memory page with more The relational database program is described above as 

data, rather than less. The second criteria is based on the writing out one page at a time. If the pages are small, the 
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program might write out multiple pages at a time to further The safe-unsafe concept can be applied to memory pages 

reduce I/O processes. This technique is particularly advan- instead of records. A memory page starts out being safe. As 

tageous when incorporated into a system that supports soon as lt receives an unsafe record (a record created after 

gather-write and scatter-read. However, all pages that are the part ition has spilled), the page changes state to unsafe, 

written out should belong to the same partition, which makes 5 ^ reachi ^ end Qf ^ . daU $tre ^ safe 

selecting the set of pages more complex. pages can be sent to the final output while unsafe pages have 

Selecting memory pages based on absorption rates is tQ bc wriUcn b thc rtition fik 
beneficial as it optimizes the opportunity for early aggrega- 
tion in the main memory. This, in turn, tends to minimize the 

number of l/Os to non-volatile memory. In addition, the 10 The following discussion provides one exemplary imple- 

ability to select memory pages as victims from any partition mentation of a selection policy that attempts to select 

is beneficial as it enables preservation of memory pages with records or pages with the least absorption rate. In general, 

the higher absorption rates, regardless of where they are, to the query processing program 66 can be configured to select 

remain in the main memory. Moreover, the sharing of the data record that has beeD least recent i y aggregated 

memory pages among the various partitions is beneficial 15 ^ anQther data and tQ ^ m ^ data 

because the program can flexibly adapt to the mcoming data fecord Qr m containing that record. The rela- 

records. At one point in the senes, one partition might be »j. u J r ° ... ■» t r> a 

. f . . . * - .f_ » Ha. i_ tional database program maintains several LRA queues, 

expenencmg a high absorption rate and thus can benefit by XI r . . . . . ™ 

having a ^proportionately higher number of memory Memory pages arc assigned to the queues based on a fill 

pages; whereas at a different point in the series, another *> factor and whe,her are orw,safe - 
partition might be experiencing a high absorption rate and 

could benefit from more memory pages. The adaptability of As an example, suppose that records are fixed length and 

the query process can be advantageous for data streams that that a memory page has room for ten records. The relational 

have temporal clustering in which like records tend to be database program creates a total of twenty LRA queues (plus 

grouped together in the series, even though the series is ^ a list of free pages): one queue for any unsafe pages that 

otherwise evenly distributed. Memory sharing adapts to currently store ten records, one queue for any safe pages that 

such conditions to optimize keeping high absorptive records currently store ten records, one queue for any unsafe pages 

in main memory and outputting low absorptive records to curtly storing nine records, one queue for any safe pages 

non-volatile memory. currently storing nine records, and so on. At any one time, 

Another benefit of organizing memory according to there may be none, one, or more pages listed in any one of 

chains of pages and a free pool is that it is very easy to adjust the queues. When a memory page needs to be emptied, the 

the amount of memory used for aggregation at run time. To relational database program goes to the queues and selects a 

decrease memory usage, certain memory pages are selected victim ^ ^ same order ^ m fifSt looks in the „ 10 

as victims and output to the non-volatile memory. The space rccord> unsafc „ If Q0 fa ^ - t ds ^ ^ 

is then returned to the system. To increase memory more „ 10 ^„ md ^ {q ^ ug ^ {q „ 

memory space is requested from ^system, formatted as and then * &afe „ and so on . m 

pages, and added to the free pool. Whether and when to 2 ' , . A ./ ^. . n ? 

increase or decrease memory usage is a policy decision that P a / e found » ™ ttcn to u lts P artltl0n filc ' clca ° cd ' and 

depends on the operating environment of the system. ^ J^_^ representing different 

The relational database program tracks whether partitions 
have had memory pages written out to non- volatile memory. 

Partitions that have had one or more memory pages written This approach can be easily adapted to variable length 

out to non-volatile memory are said to have "spilled", records. For variable length records, the relational database 

whereas partitions that have not yet had any memory pages 45 program establishes a number of fill factor ranges for each 

output are said to be "unspilled." The relational database memory page and assigns two LRA queues for each fill 

distinguishes between spilled and unspilled partitions, as factor range. For example, the fill factor ranges for each page 

this information is useful upon reaching the end of the input could be: over 90% filled, 80-90% filled, 70-80% filled, and 

record stream 70. so on down to 10% filled. It is not necessary that all ranges 

FIG. 11 shows a register 80 that holds a bit for each 50 are the same size; any division into ranges will do. However, 

partition. The bit is a binary "0" when the partition is uniform ranges make it easier to determine to which queue 

unspilled and set to a binary "1" when the partition changes a p a g e belongs, 
to being spilled. In this example, the partition 4, 9 is spilled 

as a result of writing the data records of memory page 64(5) _ , . . . , , 

out to non-volatile memory. Hence, the bit for partition 4, 9 S5 To s P eed U P me marching for a victim page, one might be 

is set to while the remaining bits are "10". inclined to use a large partition table and assign multiple 

-me relational database also distinguishes between "safe" ^ ntrics to thc + sam f P arlition ^ ^ » not P rcf ^ cd > 

and "unsafe" data records. A new record is marked "safe" if however > 33 lX reS L ults 1 _ in ? 00r ™ mo ? utlllzatl0n £ and low 

it is created in an unspilled partition, otherwise it is marked absorpUon rates, thereby increasing the volume of interme- 

"unsafe". Records are marked safe or unsafe when they are 60 data ' Inst6ad > Petition table size should be deter- 

created (and do not change state thereafter). When reaching mined b y ^ nu m be r of memory pages available. In experi- 

the end of the input data stream, safe records can be sent to menls conducted by the inventor, the table size was set to the 

the final output because no more aggregation of those number of pages divided by five, with an upper bound of 50. 

records is possible. Unsafe records are written to the parti- This achieves an intermediate data volume close to mini- 

tion file in non-volatile memory for further processing 65 mum. To increase search speed, it is more desirable to 

because these data records might still be aggregated more maintain a separate, much larger, hash table that is used 

with data records previously written to the partition file. strictly for lookup. 
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The following is an example of pseudo-code used to 
implement the query process. FS is a list of files to be 
processed. 



add input file of data records to FS; 
while FS is not empty do 

select an input file F from FS; 

Initialize data structures and create partition files; 

while F is not empty do 

get next input record from F; 

locate existing group record with matching values on the 

grouping columns; 
if no matching record is found then 

determine which partition P the input record belongs to by 

hashing on the grouping columns; 
locate free space for a new group record on one of the 

pages belonging to partition P; 
if no free space is found then 
if there are no free pages then 

select a set of pages with low absorption rate and all 

belonging to the same partition, to output; 
write the selected pages to the corresponding 

partition file; 
clean the pages and add them to the free list of 



-endif 

get a free page and attach it to partition P; 
locate free space for a group record on the newly added 
page; 

endif 

initialize a group record matching the input record; 

if the partition file for partition P is still empty, mark the 

new record as being safe, otherwise mark it as being 

unsafe; 

endif 

update the aggregation functions of the group record; 
endloop 

/* Reached end of input file F */ 
delete input file F and remove it from FS; 
write every page containing only safe group records to the final 
output; 

write all other pages to their associated partition files; 
add all non-empty partition files to FS; 
clear all auxiliary data structures; 
endloop 
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The invention has been described in language more or less 
specific as to structural and methodical features. It is to be 
understood, however, that the invention is not limited to the 
specific features described, since the means herein disclosed 
comprise preferred forms of putting the invention into effect. 
The invention is, therefore, claimed in any of its forms or 
modifications within the proper scope of the appended 
claims appropriately interpreted in accordance with the 
doctrine of equivalents. 

I claim: 

1. A relational database system comprising: 
a non-volatile memory; 

a volatile memory for temporarily storing a set of data 
records, the volatile memory having an amount of 
space available for query processing; 

a query processor coupled to the non-volatile and volatile 
memories to process a query of the data records accord- 
ing to at least one query parameter, the query processor 
being configured to partition the data records into 
multiple buckets for query processing using the avail- 
able space in the volatile memory; and 

the query processor being configured to aggregate data 
records having like query parameters and to occasion- 
ally select ones of the data records for writing to the 
non-volatile memory to free up part of the available 
space to receive additional data records, the query 
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processor selecting the data records for writing to the 
non-volatile memory according to how likely the data 
records will aggregate, if left in the buckets, with the 
data records to be added. 

2. A relational database system as recited in claim 1, 
wherein the query processor partitions the data records using 
a hash partitioning process. 

3. A relational database system as recited in claim 1, 
wherein the query processor selects the data records having 
the least absorption rate for writing to the non-volatile 
memory. 

4. A relational database system as recited in claim 1, 
wherein the query processor selects the data records that 
have been least recently aggregated for writing to the 
non-volatile memory. 

5. A relational database system as recited in claim 1, 
wherein the query processor dynamically utilizes the avail- 
able space among different ones of the buckets without 
requiring a specific quantity of available space be reserved 
for any particular one of the buckets. 

6. A relational database system as recited in claim 1, 
wherein the query processor distinguishes between "unsafe" 
buckets from which the selected data records are written to 
the non-volatile memory and "safe" buckets that have not 
yet had data records written to the non-volatile memory. 

7. A relational database system as recited in claim 1, 
wherein the query processor selects and writes at least one 
data record to the non-volatile memory in an event that 
addition of a new data record causes an overflow condition 
indicating that the available space will be exceeded. 

8. A relational database system as recited in claim 1, 
wherein the available space is configured as a pool of 
memory pages. 

9. A relational database system as recited in claim 8, 
wherein the query processor initially allocates at least one 
memory page per bucket. 

10. A relational database system as recited in claim 8, 
wherein the query processor moves an empty memory page 
from one of the buckets to another of the buckets. 

11. A relational database system as recited in claim 8, 
wherein when the query processor selects a data record to 
write to the non-volatile memory, the query processor writes 
a memory page that contains the selected data record. 

12. A relational database system as recited in claim 8, 
wherein the query processor partitions the data records into 
memory pages associated with corresponding buckets, and 
in an event that addition of a new data record to one of the 
buckets causes an overflow condition indicating that all of 
the memory pages are full, the query processor selects the 
data records for writing to the non-volatile memory from 
any one of the buckets. 

13. A relational database system comprising: 
a non-volatile memory; 

a volatile memory for temporarily storing a set of data 
records, the volatile memory having an amount of 
space available for query processing; 

a query processor coupled to the non-volatile and volatile 
memories to process a query of the data records accord- 
ing to at least one query parameter, the query processor 
being configured to partition the data records into 
multiple buckets for query processing using the avail- 
able space in the volatile memory; and 

the query processor being configured to aggregate data 
records having like query parameters and to occasion- 
ally select ones of the data records for writing to the 
non-volatile memory to free up part of the available 
space to receive additional data records, the query 
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processor selecting the data records for writing to the in an event that no free memory pages are left to assign, 

non -volatile memory from any one of the buckets. the query processor selects a memory page from any 

14. A relational database system as recited in claim 13, one of the partitions, writes the data records on the 
wherein the query processor selects the data records for selected memory page to the non-volatile memory to 
writing to the non-volatile memory according to how likely 5 empt y the memory page, and moves the emptied 
the data records will aggregate, if left in the buckets, with the memory page to the particular partition, 
additional data records to be added. 24. A relational database system as recited in claim 23, 

15. A relational database system as recited in claim 13, whcrcin the p rocc ssor selects the memory page 

wherein the query processor selects the data records having accordin to crtoia ^ favors QU{ t y J* l{ 

the least absorption rate for writing to the non-volatile m „ & „ £11 , r * . 

memorv memory pages over partially filled memory pages and that 

16. A relational database system as recited in claim 13, favors memor y low ****** rate over 
wherein the query processor selects the data records that me ™7 P a S e . s ^ a ^ absorption rate. 

have been least recently aggregated for writing to the 25> A relational database system as recited m claim 23, 

non-volatile memory. ~ wherein the query processor moves empty memory pages 

17. A relational database system as recited in claim 13, 15 among the partitions to satisfy incoming data records, 
wherein the query processor distinguishes between first ones 26. A relational database system as recited in claim 23, 
of the buckets from which the selected data records are wherein the query processor transfers all memory pages 
written to the non-volatile memory and second ones of the from one of the partitions to others of the partitions, 
buckets that have not yet had data records written to the 27. A relational database system as recited in claim 23, 
non-volatile memory. 20 wherein the query processor distinguishes between first 

18. A relational database system as recited in claim 13, partitions from which one or more memory pages are written 
wherein the available space is configured as a pool of to the non-volatile memory and second partitions that have 
memory pages. no t yet had any associated memory page written to the 

19. A relational database system as recited in claim 18, non-volatile memory 

wherein the query processor initially allocates at least one 25 2 8. Arelational database computer program embodied on 

memory page per bucket. a computer-readable medium comprising: 

20. A relational database system as recited in claim 18, ... , ... , . t . , 
wherein the query processor dynamically moves empty partitioning code to partition data records into multiple 
memory pages among the buckets. partitions; 

21. A relational database system as recited in claim 18, aggregation code to aggregate within respective partitions 
wherein when the query processor selects a data record to like data records; and 

write to the non-volatile memory, the query processor writes victim selection code to select at least one data record to 

a memory page that contains the selected data record. write t0 nonvolatile memory to free space in the 

22. A relational database system comprising: partitions for future data records, the victim selection 
a volatile memory for temporarily storing a set of data code selecting the data record according to how likely 

records for query processing; ^ data rccord wiU aggregatCj tf left in its rcspcctivc 

a query processor coupled to the volatile memory to partition, with the future data records, 

process a query of the data records, the query processor 2 9. A relational database computer program as recited in 

being configured to partition the data records into claim 28, wherein the partitioning code partitions the data 

multiple partitions usmg a hash partitioning table, the records using a hash partitioning process, 

hash partitioning table having entries that reference the 30. a relational database computer program as recited in 

partitions; and 28, wherein the victim selection code selects the data 

the query processor being configured to create a second record with the least absorption rate. 

hash table separate from the partitioning table that has 31. A relational database computer program as recited in 

entries that reference individual data records within the 45 claim 28, wherein the victim selection code selects the data 

partitions, the separate hash table being used to lookup record that has been least recently aggregated, 

matching data records for aggregation. 32. A relational database system comprising: 

23. A relational database system comprising: a mcmor y. m $ 

a non-volatile memory, a processm g C0U plcd to the memory and configured 

a volatile memory for temporarily stormg a set of data 50 t0 execute the relationaJ database computer program as 

records, the volatile memory having an amount of redted in daim 28 

available space segmented into multiple memory 33 In a relationa i database system having data records 

P*^^ that are partitioned into partitions and aggregated within the 

a query processor coupled to the non-volatile and volatile partitions, a query processing program executable on the 

memories to process a series of data records, the query 55 relational database system to remove at least one data record 

processor being configured to partition the data records that has the least absorption rate from a partition, 

into multiple partitions and store the data records in one 34. A query processing program as recited in claim 33, 

or more memory pages associated with the partitions; further configured to remove the data record that is least 

as a new data record is processed into a particular recently aggregated, 

partition, the query processor being configured to alter- 60 35. A query processing program as recited in claim 33, 

nately aggregate the new data record with a like data further configured to remove the data record from any one 

record already existing in the particular partition or of the partitions. 

store the data record separately within the particular 36. Arelational database computer program embodied on 
partition; a computer-readable medium comprising the query process- 
in an event that a memory page of the particular partition 65 ing program as recited in claim 33. 
becomes full, the query processor assigns an additional 37. In a relational database system having data records 
memory page to the particular partition; and that are partitioned into partitions and aggregated within the 
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partitions, a query processing program executable on the 
relational database system to add data records to respective 
partitions until an overflow condition in a particular partition 
is reached indicating that that memory space available for 
the partitions is full, the query processing program removing 
at least one data record from any one of the partitions to free 
memory space to receive future data records. 

38. A query processing program as recited in claim 37, 
further configured to remove the data record that has the 
least absorption rate. 

39. A query processing program as recited in claim 37, 
further configured to remove the data record that is least 
recently aggregated. 

40. A relational database computer program embodied on 
a computer-readable medium comprising the query process- 
ing program as recited in claim 37. 

41. In a relational database system having multiple 
memory pages for holding data records that are separated 
into predefined partitions, a query processing program 
executable on the relational database system to move empty 
memory pages among the partitions. 

42. A query processing program as recited in claim 41, 
further configured to aggregate the data records within the 
partitions wherein the partitions having the data records with 
the highest absorption rates tend to have associated more 
memory pages that the partitions having the data records 
with the lowest absorption rates. 

43. A relational database computer program embodied on 
a computer-readable medium comprising the query process- 
ing program as recited in claim 41. 

44. A method for processing a query, comprising the 
following steps: 

partitioning data records into multiple partitions; 
aggregating within respective partitions like data records; 
and 

selecting at least one data record to write to non-volatile 
memory to free space in the partitions for future data 
records according to how likely that data record will 
aggregate, if left in its respective partition, with the 
future data records. 

45. A method as recited in claim 44, wherein the parti- 
tioning step comprises the step of partitioning the data 
records using a hash partitioning process. 

46. A method as recited in claim 44, wherein the aggre- 
gating step comprises the step of hashing each new data 
record to an entry in a hash table which references any like 
data records in the partitions. 

47. A method as recited in claim 44, wherein the selecting 
step comprises the step of selecting the data record with the 
least absorption rate. 
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48. A method as recited in claim 44, wherein the selecting 
step comprises the step of selecting the data record that has 
been least recently aggregated. 

49. A computer programmed to perform the steps in the 
method as recited in claim 44. 

50. A computer-readable medium having computer- 
executable instructions for performing the steps in the 
method as recited in claim 44. 

51. In a relational database system having multiple 
memory pages for holding data records that are separated 
into predefined partitions, a method for processing a query 
of the data records, comprising the following steps: 

partitioning a new data record into one of the partitions; 

if a like data record already exists in the partition, 
aggregating the new data record with the like data 
record and otherwise storing the data record separately 
within the partition; 

in an event that a memory page of the partition is full and 
cannot accept the new data entry, adding a next 
memory page to the partition; and 

in an event that no free memory pages are left to add, 
selecting a memory page from any one of the partitions 
according to a selection criteria that favors output of 
full memory pages over partially filled memory pages 
and that favors memory pages with low absorption 
rates and writing the data records on the selected 
memory page to non-volatile memory to empty the 
memory page. 

52. A method as recited in claim 51, wherein the parti- 
tioning step comprises the step of partitioning the new data 
record using a hash partitioning process. 

53. A method as recited in claim 51, further comprising 
the step of moving the empty memory page to another 
partition to satisfy incoming data records. 

54. A method as recited in claim 51, further comprising 
the step of transferring all memory pages from one of the 
partitions to others of the partitions. 

55. A method as recited in claim 51, further comprising 
distinguishing between first partitions from which one or 
more memory pages are written to the non-volatile memory 
and second partitions that have not yet had any associated 
memory page written to the non-volatile memory. 

56. A computer programmed to perform the steps in the 
method as recited in claim 51. 

57. A computer-readable medium having computer- 
executable instructions for performing the steps in the 
method as recited in claim 51. 
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